{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Needle-in-Haystack Benchmark Notebook\n", "\n", "This notebook measures the performance of a “needle in a haystack” lookup query—finding one specific row among many—for three storage backends: SQLite, MongoDB, and ParquetDB.\n", "\n", "---\n", "\n", "## Benchmark Details\n", "\n", "- **Data Generation:** \n", " - Each run generates *N* rows × 100 columns of integers, with exactly one row set to `–1` (“the needle”) and all other entries random in [0, 1 000 000]. \n", " - Row counts tested: `1, 10, 100, 1 000, 10 000, 100 000, 1 000 000`. \n", " - Integers chosen as a basic primitive type—byte size is the main factor, so these results represent a **lower bound** on query time. More complex or larger types will incur higher cost.\n", "\n", "- **Parquet Normalization Settings (defaults):** \n", " - **Row‐group size:** 50 000–100 000 rows per group \n", " - **Max rows per file:** 10 000 000 \n", " - Tuning these can shift performance between inserts, reads, and updates.\n", "\n", "---\n", "\n", "## System Specifications\n", "\n", "- **Operating System:** Windows 10 \n", "- **Processor:** AMD Ryzen 7 3700X 8‑Core @ 3.6 MHz (8 cores, 16 logical processors) \n", "- **Memory:** 128 GB DDR4‑3600 MHz (4×32 GB DIMMs) \n", "- **Storage**: SATA HDD 2TB (Model: ST2000DM008-2FR102)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install parquetdb\n", "!pip install pymongo" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import time\n", "import random\n", "import shutil\n", "import sqlite3\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import matplotlib.ticker as ticker\n", "from mpl_toolkits.axes_grid1.inset_locator import inset_axes\n", "from pymongo import MongoClient\n", "import pyarrow.compute as pc\n", "\n", "from parquetdb import ParquetDB, config\n", "\n", "\n", "bench_dir = os.path.join(config.data_dir, \"benchmarks\")\n", "sqlite_dir = os.path.join(bench_dir, \"sqlite\")\n", "mongo_dir = os.path.join(bench_dir, \"mongodb\")\n", "pq_dir = os.path.join(bench_dir, \"parquetdb\")\n", "for d in (sqlite_dir, mongo_dir, pq_dir):\n", " os.makedirs(d, exist_ok=True)\n", "\n", "row_counts = [1, 10, 100, 1_000, 10_000, 100_000, 1_000_000]\n", "n_cols = 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. SQLite Becnhmark" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Helper functions" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def generate_data_sqlite(n_rows, n_cols=100, needle=-1):\n", " idx = random.randrange(n_rows)\n", " data = []\n", " for i in range(n_rows):\n", " vals = [needle]*n_cols if i == idx else [random.randint(0,1_000_000) for _ in range(n_cols)]\n", " data.append(tuple(vals))\n", " return data\n", "\n", "def run_sqlite(n_rows, use_index):\n", " db_file = os.path.join(sqlite_dir, f\"bench_{n_rows}_{'idx' if use_index else 'noidx'}.db\")\n", " if os.path.exists(db_file): os.remove(db_file)\n", "\n", " # Insert\n", " start = time.time()\n", " conn = sqlite3.connect(db_file)\n", " cols = \", \".join(f\"col{i} INTEGER\" for i in range(n_cols))\n", " conn.execute(f\"CREATE TABLE t ({cols})\")\n", " conn.execute(\"PRAGMA synchronous = OFF\")\n", " conn.execute(\"PRAGMA journal_mode = MEMORY\")\n", " placeholders = \", \".join(\"?\" for _ in range(n_cols))\n", " conn.executemany(f\"INSERT INTO t VALUES ({placeholders})\", generate_data_sqlite(n_rows, n_cols))\n", " conn.commit()\n", " conn.close()\n", " insert_time = time.time() - start\n", "\n", " # Optional index\n", " if use_index:\n", " conn = sqlite3.connect(db_file)\n", " conn.execute(\"CREATE INDEX idx_col0 ON t(col0)\")\n", " conn.commit()\n", " conn.close()\n", "\n", " # Query\n", " start = time.time()\n", " conn = sqlite3.connect(db_file)\n", " cur = conn.cursor()\n", " cur.execute(\"SELECT col0 FROM t WHERE col0 = -1\")\n", " assert cur.fetchone()[0] == -1\n", " conn.close()\n", " read_time = time.time() - start\n", "\n", " return insert_time, read_time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Run without Index" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "results_sqlite = {\"n_rows\": [], \"insert_noidx\": [], \"read_noidx\": []}\n", "for n in row_counts:\n", " it, rt = run_sqlite(n, use_index=False)\n", " results_sqlite[\"n_rows\"].append(n)\n", " results_sqlite[\"insert_noidx\"].append(it)\n", " results_sqlite[\"read_noidx\"].append(rt)\n", "df_sqlite_noidx = pd.DataFrame(results_sqlite)\n", "df_sqlite_noidx.to_csv(os.path.join(sqlite_dir, \"sqlite_noidx.csv\"), index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Run with Index" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "results_sqlite_idx = {\"n_rows\": [], \"insert_idx\": [], \"read_idx\": []}\n", "for n in row_counts:\n", " it, rt = run_sqlite(n, use_index=True)\n", " results_sqlite_idx[\"n_rows\"].append(n)\n", " results_sqlite_idx[\"insert_idx\"].append(it)\n", " results_sqlite_idx[\"read_idx\"].append(rt)\n", "df_sqlite_idx = pd.DataFrame(results_sqlite_idx)\n", "df_sqlite_idx.to_csv(os.path.join(sqlite_dir, \"sqlite_idx.csv\"), index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. MongoDB Benchmark" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Helper functions" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def generate_data_mongo(n_rows, n_cols=100, needle=-1):\n", " idx = random.randrange(n_rows)\n", " docs = []\n", " for i in range(n_rows):\n", " base = {f\"col_{j}\": needle for j in range(n_cols)} if i == idx else \\\n", " {f\"col_{j}\": random.randint(0,1_000_000) for j in range(n_cols)}\n", " docs.append(base)\n", " return docs\n", "\n", "def run_mongo(n_rows, use_index):\n", " client = MongoClient(\"mongodb://localhost:27017/\")\n", " db = client.benchmark\n", " coll = db.t\n", " client.drop_database(\"benchmark\")\n", "\n", " # Insert\n", " start = time.time()\n", " coll.insert_many(generate_data_mongo(n_rows, n_cols))\n", " insert_time = time.time() - start\n", "\n", " # Index?\n", " if use_index:\n", " coll.create_index(\"col_0\")\n", "\n", " # Query\n", " start = time.time()\n", " res = list(coll.find({\"col_0\": -1}, {\"col_0\":1, \"_id\":0}))\n", " assert res[0][\"col_0\"] == -1\n", " read_time = time.time() - start\n", " client.close()\n", "\n", " return insert_time, read_time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Run without Index" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "results_mongo_noidx = {\"n_rows\": [], \"insert_noidx\": [], \"read_noidx\": []}\n", "for n in row_counts:\n", " it, rt = run_mongo(n, use_index=False)\n", " results_mongo_noidx[\"n_rows\"].append(n)\n", " results_mongo_noidx[\"insert_noidx\"].append(it)\n", " results_mongo_noidx[\"read_noidx\"].append(rt)\n", "df_mongo_noidx = pd.DataFrame(results_mongo_noidx)\n", "df_mongo_noidx.to_csv(os.path.join(mongo_dir, \"mongo_noidx.csv\"), index=False)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | n_rows | \n", "insert_noidx | \n", "read_noidx | \n", "
|---|---|---|---|
| 0 | \n", "1 | \n", "0.018000 | \n", "0.000000 | \n", "
| 1 | \n", "10 | \n", "0.012573 | \n", "0.000000 | \n", "
| 2 | \n", "100 | \n", "0.023000 | \n", "0.000000 | \n", "
| 3 | \n", "1000 | \n", "0.139594 | \n", "0.002002 | \n", "
| 4 | \n", "10000 | \n", "1.114362 | \n", "0.007001 | \n", "